Package com.doclinx.ftxml

Provides all classes that comprise the TeraXMLTM full-text XML retrieval and search system.

See:
          Description

Class Summary
ALL_PARMS Parameter block class is shared between merge and index build steps.
AppParms Abstract callback class for passing application specific parameters on a file-by-file basis.
CatalogItem This class represents a Catalog information block or entry.
CatalogManager The class is the PRIMARY API of the TeraXML full-text search build and management package.
CatalogSearch This class is the API for the search and retrieval component of TeraXML.
DPAPI_PARMS Parameter block class that controls optional functionality during the index build phase.
InputCallback Extendable callback class that enables an application to control opening the input source.
MERGE_PARMS Parameter block class that controls optional functionality during the merging of two database indexes and files.
MXSSearch This class permits sequential searching of multiple catalogs with the same query.
SRC2STF_PARMS Parameter block class that controls optional functionality during the text parsing phase.
 

Exception Summary
CatalogMgrException This class comprises the exception thrown by the CatalogManager class.
CatalogSearchException This class comprises the exception thrown by the CatalogSearch class.
 

Package com.doclinx.ftxml Description

Provides all classes that comprise the TeraXMLTM full-text XML retrieval and search system. The system design employs a catalog metaphor for a collection of related documents that can be searched using a boolean query language. The system is designed to handle very large amounts of data in the range of terabytes. XML or HTML input data is supported in the Java version.

There are two primary classes the applications instantiate to build a catalog and then subsequently employ to access, search and update the collection. The CatalogManager class control creation and management of a catalog. A catalog is associated with a directory containing the files and data structures for maintaining a searchable collection of documents. The CatalogSearch class provides access to the main catalog data structures and formulation of queries in order to search for specific information located within a catalog.

A catalog is created with the CatalogManager object. Creation of a catalog requires a root directory where definition and template files are located. these files are copied to the directory and are used in for controlling some aspects on the index prospects. These copied files and generally not modified by applications. Typically, a catlog is created with a large body of documents and then is added to over time. For instance, a website is crawled and then only files from that site that are new or have changed are added to the catalog collection. Once a catalog is created, it can then be opened for access for searching or updates. A catalog consists of a primary data store and an update data store. The primary database should contain the majority of documents. Then, as smaller sets of documents are added, they are placed in the update database. These two databases can be merged as efficiency requirements dictate. A general rule of thumb is that the update database should not be larger than 10% of the primary database.

A database is built by adding files to the catalog. Once the desired collection of documents has been added, then the catalog is "updated". Any other CatalogManager objects accessing the samer catalog are updated when the operation is completed. Files can be added one at a time, by directory specification, or from a map or list file. A map file is created from a web crawl activity. Adding the files results in their scanning for information which is stored in the catalog. Information collected includes type of document, abstract, title, and tag context, etc.

Searching a catalog is done with the CatalogSearch class. Typically, an application creates and opens a CatalogSearch class to an exisiting catalog. A search is performed by submitting a query and then retrieving the results of the search (if the the query results in a set of matching documents). The query language defines a full set of boolean operations and also includes syntax for XPATH specification in order to provide context- sensitive search on databases composed of XML documents. The result list of documents can then be used as keys to access the catalog to find the auxiliary information associated with the "found' document (e.g. title, abstract, etc.). Search and catalog access are designed to be very efficient and independent of database size.

Package Specification

Related Documentation

For overviews, tutorials, examples, guides, and tool documentation, please see:

See Also:
CatalogManager class for building a catalog., CatalogSearch for searching and accessing catalog information.